Deceptive behaviors are difficult to stop once an AI model learns them

Researchers from Anthropic, an OpenAI rival, discovered that it’s hard to stop artificial intelligence deception.

Researchers from Anthropic, an artificial intelligence startup backed by Amazon, recently published a paper in which they showed that once an AI model learns deceptive behaviors, it can be very challenging to stop them.

The paper the researchers co-authored showed that artificial intelligence can be trained to behave deceptively.

Beyond that, the researchers also concluded that once an AI model is trained to exhibit deceptive behaviors, it’s hard to get rid of them. In fact, they stated that standard safety training techniques may “fail to remove such deception,” and could in fact “create a false impression of safety.” That is, by attempting to overcome the taught deceptive behaviors, it could just make the artificial intelligence even better at deceiving people.

AI Anthropic on mobile phone screen — Credit: Photo by depositphotos.com

The researchers based their conclusions on the training of models based on Claude, Anthropic’s chatbot. They trained them to behave unsafely when certain triggers were used. For instance, they trained the models to produce secure code when the year 2023 was used as a prompt, but to insert vulnerabilities into the code if the prompt contained the year 2024.

An AI model can go on to behave deceptively and in fact have those behaviors reinforced through correction.

The researchers determined that deceptive behavior became too intrenched into the AI model to simply “train away” by using conventional standard safety training techniques. Among those strategies is one called adversarial training. That technique triggers the unwanted behavior and then penalizes it. That said, using this method can teach the artificial intelligence to improve itself at hiding its deceptive behaviors.

“This would potentially call into question any approach that relies on eliciting and then disincentivizing deceptive behavior,” wrote the paper’s authors.

Though this statement is rather concerning, according to the researchers, they aren’t worried about the likelihood of artificial intelligence “naturally” developing the deceptive behaviors.

Athropic has claimed that it has made AI model safety a top priority since the company was first founded by a team of former OpenAI staff members. Among them was Dario Amodei, who said that the primary reason he left AI was to develop a safer form of artificial intelligence.

Researchers from Anthropic, an OpenAI rival, discovered that it’s hard to stop artificial intelligence deception.

The paper the researchers co-authored showed that artificial intelligence can be trained to behave deceptively.

An AI model can go on to behave deceptively and in fact have those behaviors reinforced through correction.

Leave a Comment Cancel reply